In [1]:
import math
import matplotlib.pyplot as plt

In [2]:
%%capture
%run "6 - Probability.ipynb"

Statistical Hypothesis Testing

Classical testing involves a null hypothesis $H_0$ that represents the default and an alternative hypothesis $H_1$ to test. Statistics helps us to determine whether $H_0$ should be considered false or not.

Example: Flipping A Coin

Null hypothesis = the coin is fair = $p = 0.5$ Alternative hypothesis = coins is not fair = $p \not = 0.5$

To test, $n$ samples will be collected. Each toss is a Bernoulli(n, p) trial.


In [5]:
def normal_approximation_to_binomial(n, p):
    """return mu and sigma corresponding to Binomial(n, p)"""
    mu = p * n
    sigma = math.sqrt(p * (1 - p) * n)
    return mu, sigma

In [3]:
normal_probability_below = normal_cdf

def normal_probability_above(lo, mu=0, sigma=1):
    return 1 - normal_cdf(lo, mu, sigma)

def normal_probability_between(lo, hi, mu=0, sigma=1):
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)

def normal_probabilty_outside(lo, hi, mu=0, sigma=1):
    return 1 - normal_probability_between(lo, hi, mu, sigma)

def normal_upper_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z <= z) = probability"""
    return inverse_normal_cdf(probability, mu, sigma)

def normal_lower_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z >= z) = probability"""
    return inverse_normal_cdf(1 - probability, mu, sigma)

def normal_two_sided_bounds(probability, mu=0, sigma=1):
    """returns the symmetric (about the mean) bounds
    that contain the specified probability"""
    tail_probability = (1 - probability) / 2
    
    # upper bound should have tail_probability above it
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)

    # lower bound should have tail_probability below it
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)
    
    return lower_bound, upper_bound

Now we flip the coin 1,000 times and see if our null hypothesis is true. If so, $X$ will be approximately normally distributed with a mean of 500 and a standard deviation of 15.8:


In [10]:
mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)
mu_0, sigma_0


Out[10]:
(500.0, 15.811388300841896)

A decision must be made with respect to significance, i.e., how willing are we to accept "false positives" (type 1 errors) by rejecting $H_0$ even though it is true? This is often set to 5% or 1%. We will use 5%:


In [11]:
normal_two_sided_bounds(0.95, mu_0, sigma_0)


Out[11]:
(469.01026640487555, 530.9897335951244)

If $H_0$ is true, $p$ should equal 0.5. The interval above denotes the values outside of which there is a 5% chance that this is false. Now we can determine the power of a test, i.e., how willing are we to accept "false positives" (type 2 errors) by failing to reject $H_0$ even though it is false.


In [12]:
# 95% bounds based on assumption p is 0.5
lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)

# actual mu and sigma based on p = 0.55
mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)

# a type 2 error means we fail to reject the null hypothesis
# which will happen when X is still in our original interval
type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
power = 1 - type_2_probability # 0.887

Run a 5% significance test to find the cutoff below which 95% of the probability lies:


In [13]:
hi = normal_upper_bound(0.95, mu_0, sigma_0)

# is 526 (< 531, since we need more probability in the upper tail)
type_2_probability = normal_probability_below(hi, mu_1, sigma_1)

power = 1 - type_2_probability # 0.936

In [15]:
def two_sided_p_value(x, mu=0, sigma=1):
    if x >= mu:
        # if x is greater than the mean, the tail is what's greater than x
        return 2 * normal_probability_above(x, mu, sigma)
    else:
        # if x is less than the mean, the tail is what's less than x
        return 2 * normal_probability_below(x, mu, sigma)

two_sided_p_value(529.5, mu_0, sigma_0) # 0.062


Out[15]:
0.06207721579598857

In [17]:
extreme_value_count = 0
for _ in range(100000):
    num_heads = sum(1 if random.random() < 0.5 else 0 # count # of heads
                    for _ in range(1000)) # in 1000 flips
    if num_heads >= 530 or num_heads <= 470: # and count how often
        extreme_value_count += 1 # the # is 'extreme'

print(extreme_value_count / 100000) # 0.062


0.06233

In [18]:
two_sided_p_value(531.5, mu_0, sigma_0) # 0.0463


Out[18]:
0.046345287837786575

Make sure your data is roughly normally distributed before using normal_probability_above to compute p-values. The annals of bad data science are filled with examples of people opining that the chance of some observed event occurring at random is one in a million, when what they really mean is “the chance, assuming the data is distributed normally,” which is pretty meaningless if the data isn’t.

There are various statistical tests for normality, but even plotting the data is a good start.

Confidence Intervals

We can construct a confidence interval around an observed value of a parameter. We can do this for our assumption of an unfair coin (biased towards heads in that we observed 525 of 1,000 flips giving heads):


In [19]:
math.sqrt(p * (1 - p) / 1000)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-4f57f5ecb6ba> in <module>()
----> 1 math.sqrt(p * (1 - p) / 1000)

NameError: name 'p' is not defined

Not knowing $p$ we use our estimate:


In [21]:
p_hat = 525 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
sigma


Out[21]:
0.015791611697353755

Assuming a normal distribution, we conclude that we are 95% confident that the interval below includes the true $p$:


In [22]:
normal_two_sided_bounds(0.95, mu, sigma)


Out[22]:
(0.4940490278129096, 0.5559509721870904)

0.5 falls within our interval so we do not conclude that the coin is unfair.

What if we observe 540 heads though?


In [23]:
p_hat = 540 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000) # 0.0158
normal_two_sided_bounds(0.95, mu, sigma) # [0.5091, 0.5709]


Out[23]:
(0.5091095927295919, 0.5708904072704082)

In this scenario, 0.5 falls outside of our interval so the "fair coin" hypothesis is not confirmed.

P-hacking

P-hacking involves using various "hacks" to get a $p$ value to go below 0.05: creating a superfluous number of hypotheses, selectively removing outliers, etc.


In [24]:
def run_experiment():
    """flip a fair coin 1000 times, True = heads, False = tails"""
    return [random.random() < 0.5 for _ in range(1000)]

def reject_fairness(experiment):
    """using the 5% significance levels"""
    num_heads = len([flip for flip in experiment if flip])
    return num_heads < 469 or num_heads > 531

random.seed(0)
experiments = [run_experiment() for _ in range(1000)]
num_rejections = len([experiment for experiment in experiments if reject_fairness(experiment)])
num_rejections # 46


Out[24]:
46

Valid inferences come from a priori hypotheses (hypotheses created before collecting any data) and data cleansing without reference to the hypotheses.

Example: Running An A/B Test


In [26]:
def estimated_parameters(N, n):
    p = n / N
    sigma = math.sqrt(p * (1 - p) / N)
    return p, sigma

def a_b_test_statistic(N_A, n_A, N_B, n_B):
    p_A, sigma_A = estimated_parameters(N_A, n_A)
    p_B, sigma_B = estimated_parameters(N_B, n_B)
    return (p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2)

z = a_b_test_statistic(1000, 200, 1000, 180)
z


Out[26]:
-1.1403464899034472

In [27]:
two_sided_p_value(z)


Out[27]:
0.254141976542236

In [28]:
z = a_b_test_statistic(1000, 200, 1000, 150)
two_sided_p_value(z)


Out[28]:
0.003189699706216853

Bayesian Inference

In Bayesian Inference we start with a prior distribution for the parameters and then use the actual observations to get the posterior distribution of the same parameters. So instead of judging the probability of hypotheses, we make probability judgments about the parameters.

We will use the Beta distribution to convert all probabilities to a value between 0 and 1.


In [29]:
def B(alpha, beta):
    """a normalizing constant so that the total probability is 1"""
    return math.gamma(alpha) * math.gamma(beta) / math.gamma(alpha + beta)

def beta_pdf(x, alpha, beta):
    if x < 0 or x > 1: # no weight outside of [0, 1]
        return 0
    return x ** (alpha - 1) * (1 - x) ** (beta - 1) / B(alpha, beta)